What is IPython Notebook?

Interactive python shell runs in the background and communicates with a browser, supports Matplots, HTML, JS, TeX markup, images inline

Similar to Mathematica, crucial concept is that it allows you to query data interactively and quickly, saving your input and output

IPython site

I am using WinPython which is a portable Python environment.

There are others such as enthought, python x,y.



In [ ]:

    
# display help
# press ctrl + m + h



In [1]:

    
print 'hello world!'









    



hello world!

We can insert cells at any position, remove cells, copy and paste entire cells

supports path completion (on linux and macs only), command completion, docstring and will open the source code (if available)

also supports shell commands



In [2]:

    
# ping and other shell commands are supported, prefix with !
!ping www.quantel.com









    



Pinging www.quantel.com [10.54.1.107] with 32 bytes of data:
Reply from 10.54.1.107: bytes=32 time=1ms TTL=64
Reply from 10.54.1.107: bytes=32 time<1ms TTL=64
Reply from 10.54.1.107: bytes=32 time<1ms TTL=64
Reply from 10.54.1.107: bytes=32 time<1ms TTL=64

Ping statistics for 10.54.1.107:
    Packets: Sent = 4, Received = 4, Lost = 0 (0% loss),
Approximate round trip times in milli-seconds:
    Minimum = 0ms, Maximum = 1ms, Average = 0ms



In [3]:

    
# display docstring
enumerate?



In [4]:

    
# can also open the source file
from pandas import *
read_csv

supports 'magic' functions



In [6]:

    
def for_loop():
    ret = []
    for x in range(10000):
        ret.append(x)
    return ret

def list_comp():
    return [x for x in range(10000)]

# which is faster?
%timeit for_loop()
%timeit list_comp()









    



100 loops, best of 3: 2.15 ms per loop
1000 loops, best of 3: 729 us per loop



In [7]:

    
# list of all magic functions
%lsmagic









    



Available line magics:
%alias  %alias_magic  %autocall  %automagic  %bookmark  %cd  %clear  %cls  %colors  %config  %connect_info  %debug  %dhist  %dirs  %doctest_mode  %ed  %edit  %env  %gui  %hist  %history  %install_default_config  %install_ext  %install_profiles  %killbgscripts  %less  %load  %load_ext  %loadpy  %logoff  %logon  %logstart  %logstate  %logstop  %lsmagic  %macro  %magic  %more  %notebook  %page  %pastebin  %pdb  %pdef  %pdoc  %pfile  %pinfo  %pinfo2  %popd  %pprint  %precision  %profile  %prun  %psearch  %psource  %pushd  %pwd  %pycat  %pylab  %qtconsole  %quickref  %recall  %rehashx  %reload_ext  %rep  %rerun  %reset  %reset_selective  %run  %save  %sc  %store  %sx  %system  %tb  %time  %timeit  %unalias  %unload_ext  %who  %who_ls  %whos  %xdel  %xmode

Available cell magics:
%%!  %%capture  %%cmd  %%file  %%powershell  %%prun  %%script  %%sx  %%system  %%timeit

Automagic is ON, % prefix IS NOT needed for line magics.

notebook about magic commands

supports inline display of plots



In [8]:

    
plot(rand(100))









    Out[8]:





[<matplotlib.lines.Line2D at 0x87505c0>]

and images



In [9]:

    
from IPython.core.display import Image
#Image(filename=r'Desktop\kid_f_off.jpg')
Image(filename=r'Downloads\Happy_Holidays.jpg')









    



---------------------------------------------------------------------------
IOError                                   Traceback (most recent call last)
<ipython-input-9-6d1a920a8ad5> in <module>()
      1 from IPython.core.display import Image
      2 #Image(filename=r'Desktop\kid_f_off.jpg')
----> 3 Image(filename=r'Downloads\Happy_Holidays.jpg')

C:\WinPython-64bit-2.7.5.3\python-2.7.5.amd64\lib\site-packages\IPython\core\display.pyc in __init__(self, data, url, filename, format, embed)
    478         self.format = unicode(format).lower()
    479         self.embed = embed if embed is not None else (url is None)
--> 480         super(Image, self).__init__(data=data, url=url, filename=filename)
    481 
    482     def reload(self):

C:\WinPython-64bit-2.7.5.3\python-2.7.5.amd64\lib\site-packages\IPython\core\display.pyc in __init__(self, data, url, filename)
    263             self.url = url
    264             self.filename = None if filename is None else unicode(filename)
--> 265         self.reload()
    266 
    267     def reload(self):

C:\WinPython-64bit-2.7.5.3\python-2.7.5.amd64\lib\site-packages\IPython\core\display.pyc in reload(self)
    483         """Reload the raw data from file or URL."""
    484         if self.embed:
--> 485             super(Image,self).reload()
    486 
    487     def _repr_html_(self):

C:\WinPython-64bit-2.7.5.3\python-2.7.5.amd64\lib\site-packages\IPython\core\display.pyc in reload(self)
    268         """Reload the raw data from file or URL."""
    269         if self.filename is not None:
--> 270             with open(self.filename, self._read_flags) as f:
    271                 self.data = f.read()
    272         elif self.url is not None:

IOError: [Errno 2] No such file or directory: u'Downloads\\Happy_Holidays.jpg'



In [10]:

    
# the above was a delibarate error to demonstrate traceback output
#Image(filename=r'C:\Users\alanwo\Desktop\kid_f_off.jpg')
Image(filename=r'C:\Downloads\Happy_Holidays.jpg')









    Out[10]:

display youtube videos too!



In [11]:

    
from IPython.lib.display import YouTubeVideo
YouTubeVideo('tGvHNNOLnCk')









    Out[11]:

and boring math

$\left( \sum_{k=1}^n a_k b_k \right)^2 \leq \left( \sum_{k=1}^n a_k^2 \right) \left( \sum_{k=1}^n b_k^2 \right)$

It also supports mixing python with other languages such as cython, R, Octave.

Scripting with Bash, Perl and Ruby are also supported.

Clustering is supported, multiple instances can be launched and the work divided between them.

What is Pandas?

Pandas consists of:

• A set of labeled array data structures, the primary of which are Series/TimeSeries and DataFrame

• Index objects enabling both simple axis indexing and multi-level (lookups, data alignment, and reindexing)

• An integrated group by engine for aggregating and transforming data sets

• Date range generation (date_range)

• Input/Output tools: loading tabular data from flat files (CSV, delimited, Excel), and saving and loading pandas objects from the fast and efficient PyTables/HDF5 format.

• Memory-efficent “sparse” versions of the standard data structures for storing data that is mostly missing or mostly constant (some fixed value)

• Moving window statistics (rolling mean, rolling standard deviation, etc.)

• Static and moving window linear and panel regression

Pandas documentation

If you can't be bothered to read the above then you can think of Pandas as an in memory DB tool (but it is much more than this)

Some simple examples



In [12]:

    
from pandas import *
# Series
# s = Series(data, index=index)
# data => a Python dict, an ndarray, a scalar value 
# NaN for missing data
sample_series = Series(randn(5))
sample_series









    Out[12]:





0   -0.169314
1    0.972852
2    0.173603
3    0.549859
4   -1.262332
dtype: float64



In [13]:

    
# we can pass an index list for the rows
sample_series = Series(randn(5), index=['A','B','C','D','E'])
sample_series









    Out[13]:





A    0.267264
B   -0.821806
C    2.045298
D   -0.316652
E    1.313129
dtype: float64

Series can be thought of as 1-d arrays (really a 1-d numpy ndarray underneath)

They are the fundamental datatype in Pandas



In [14]:

    
# calling .values decomposes to the underlying data
sample_series.values









    Out[14]:





array([ 0.26726378, -0.82180644,  2.04529782, -0.31665223,  1.31312935])



In [15]:

    
# some simple indexing
sample_series[0]









    Out[15]:





0.26726378406431067



In [16]:

    
sample_series[2]









    Out[16]:





2.0452978176194083



In [17]:

    
# DataFrame is one of : 
#  Dict of 1D ndarrays, lists, dicts, or Series
#  2-D numpy.ndarray
#  Structured or record ndarray
#  A Series
#  Another DataFrame
sample_df = DataFrame(randn(5))
sample_df



In [18]:

    
# again we can pass the column and row names
sample_df = DataFrame({'A':randn(5)}, index=['1', 'c', '5', 'g', 't']) # use abritrary row values
sample_df



In [19]:

    
# data can consist of anything and can be hetereogenous for a column
from datetime import datetime
sample_df = DataFrame( {'A':['hello', 9, 12312.00, datetime.now(), [x for x in range(5)]]})
sample_df









    Out[19]:






  
    
      
      A
    
  
  
    
      0
                            hello
    
    
      1
                                9
    
    
      2
                            12312
    
    
      3
       2013-09-20 15:17:20.578000
    
    
      4
                  [0, 1, 2, 3, 4]



In [20]:

    
# we can construct from a dict of Series
d = {'col1':Series(randn(5), index=['a','b','c','d','e']), 'col2':Series(randn(4), index=['a','b','c','d'])}
d









    Out[20]:





{'col1': a    0.470946
b    0.585088
c    1.641423
d   -0.207128
e   -0.803649
dtype: float64,
 'col2': a    0.017838
b    0.678610
c   -0.088913
d   -0.738760
dtype: float64}



In [21]:

    
sample_df = DataFrame(d)
sample_df



In [22]:

    
# column selection
sample_df['col1']









    Out[22]:





a    0.470946
b    0.585088
c    1.641423
d   -0.207128
e   -0.803649
Name: col1, dtype: float64



In [23]:

    
# this is really a series
type(sample_df['col1'])









    Out[23]:





pandas.core.series.Series



In [24]:

    
# different to this, which is a dataframe
type(sample_df[['col1']])









    Out[24]:





pandas.core.frame.DataFrame



In [25]:

    
# row selection in dataframes using fancy indexing
sample_df.ix['a']









    Out[25]:





col1    0.470946
col2    0.017838
Name: a, dtype: float64



In [26]:

    
# can also use irow, integer based indexing
sample_df.irow(1)









    Out[26]:





col1    0.585088
col2    0.678610
Name: b, dtype: float64



In [27]:

    
# add a column
sample_df['new_col'] = randn(5)
sample_df



In [28]:

    
# can compute new values based on other cols
sample_df['diff'] = sample_df['col1'] - sample_df['col2']
sample_df



In [29]:

    
from datetime import datetime
sample_df['date'] = datetime.now()
sample_df









    Out[29]:






  
    
      
      col1
      col2
      new_col
      diff
      date
    
  
  
    
      a
       0.470946
       0.017838
      -0.551610
       0.453109
      2013-09-20 15:20:05.470000
    
    
      b
       0.585088
       0.678610
      -0.424923
      -0.093522
      2013-09-20 15:20:05.470000
    
    
      c
       1.641423
      -0.088913
       1.158524
       1.730336
      2013-09-20 15:20:05.470000
    
    
      d
      -0.207128
      -0.738760
      -1.506391
       0.531632
      2013-09-20 15:20:05.470000
    
    
      e
      -0.803649
            NaN
      -0.754853
            NaN
      2013-09-20 15:20:05.470000



In [30]:

    
# we can apply functions to a column, row or entire dataframe
sample_df['day'] = sample_df['date'].apply(lambda x: x.day)
sample_df









    Out[30]:






  
    
      
      col1
      col2
      new_col
      diff
      date
      day
    
  
  
    
      a
       0.470946
       0.017838
      -0.551610
       0.453109
      2013-09-20 15:20:05.470000
       20
    
    
      b
       0.585088
       0.678610
      -0.424923
      -0.093522
      2013-09-20 15:20:05.470000
       20
    
    
      c
       1.641423
      -0.088913
       1.158524
       1.730336
      2013-09-20 15:20:05.470000
       20
    
    
      d
      -0.207128
      -0.738760
      -1.506391
       0.531632
      2013-09-20 15:20:05.470000
       20
    
    
      e
      -0.803649
            NaN
      -0.754853
            NaN
      2013-09-20 15:20:05.470000
       20



In [31]:

    
# note that NaN values are propagated, we can deal with this in a few ways
# drop them

copy = sample_df.dropna()
copy









    Out[31]:






  
    
      
      col1
      col2
      new_col
      diff
      date
      day
    
  
  
    
      a
       0.470946
       0.017838
      -0.551610
       0.453109
      2013-09-20 15:20:05.470000
       20
    
    
      b
       0.585088
       0.678610
      -0.424923
      -0.093522
      2013-09-20 15:20:05.470000
       20
    
    
      c
       1.641423
      -0.088913
       1.158524
       1.730336
      2013-09-20 15:20:05.470000
       20
    
    
      d
      -0.207128
      -0.738760
      -1.506391
       0.531632
      2013-09-20 15:20:05.470000
       20



In [32]:

    
# or replace with missing values
copy = sample_df
copy.fillna(0, inplace=True)
copy









    Out[32]:






  
    
      
      col1
      col2
      new_col
      diff
      date
      day
    
  
  
    
      a
       0.470946
       0.017838
      -0.551610
       0.453109
      2013-09-20 15:20:05.470000
       20
    
    
      b
       0.585088
       0.678610
      -0.424923
      -0.093522
      2013-09-20 15:20:05.470000
       20
    
    
      c
       1.641423
      -0.088913
       1.158524
       1.730336
      2013-09-20 15:20:05.470000
       20
    
    
      d
      -0.207128
      -0.738760
      -1.506391
       0.531632
      2013-09-20 15:20:05.470000
       20
    
    
      e
      -0.803649
       0.000000
      -0.754853
       0.000000
      2013-09-20 15:20:05.470000
       20



In [33]:

    
# we can select multiple columns
copy = sample_df[['col1', 'col2']]
copy



In [34]:

    
# or a cross section
copy = sample_df[['col1', 'col2']][1:4]
copy
# note that for index selection [beg, end) - beg is included, end is excluded



In [35]:

    
# we can also do label selection on rows and slice the columns
copy = sample_df.loc['b':, 'col1':'new_col']
copy



In [36]:

    
# we can create a boolean mask
sample_df['col1']>0









    Out[36]:





a     True
b     True
c     True
d    False
e    False
Name: col1, dtype: bool



In [37]:

    
# then use this for selection
copy = sample_df[sample_df['col1']>0]
copy









    Out[37]:






  
    
      
      col1
      col2
      new_col
      diff
      date
      day
    
  
  
    
      a
       0.470946
       0.017838
      -0.551610
       0.453109
      2013-09-20 15:20:05.470000
       20
    
    
      b
       0.585088
       0.678610
      -0.424923
      -0.093522
      2013-09-20 15:20:05.470000
       20
    
    
      c
       1.641423
      -0.088913
       1.158524
       1.730336
      2013-09-20 15:20:05.470000
       20



In [38]:

    
# can also use more advanced criteria
sample_df.ix[(sample_df['col1'] < 0) & (sample_df['diff']<0)]
# Note do not use or/and for boolean comparisons, you must use the bitwise versions |/& also for negation ~









    Out[38]:






  
    
      Index([], dtype=object)
      Empty DataFrame



In [39]:

    
# we can get a specific value at row and column
sample_df.loc['c', 'col2']









    Out[39]:





-0.088912753762695362



In [40]:

    
# yes fancy reverse works too
sample_df[::-1]









    Out[40]:






  
    
      
      col1
      col2
      new_col
      diff
      date
      day
    
  
  
    
      e
      -0.803649
       0.000000
      -0.754853
       0.000000
      2013-09-20 15:20:05.470000
       20
    
    
      d
      -0.207128
      -0.738760
      -1.506391
       0.531632
      2013-09-20 15:20:05.470000
       20
    
    
      c
       1.641423
      -0.088913
       1.158524
       1.730336
      2013-09-20 15:20:05.470000
       20
    
    
      b
       0.585088
       0.678610
      -0.424923
      -0.093522
      2013-09-20 15:20:05.470000
       20
    
    
      a
       0.470946
       0.017838
      -0.551610
       0.453109
      2013-09-20 15:20:05.470000
       20



In [41]:

    
# we can perform some simple stats
sample_df.describe()



In [42]:

    
# do some aggregation like sum, count, cumsum and more
sample_df.sum()









    Out[42]:





col1         1.686681
col2        -0.131225
new_col     -2.079252
diff         2.621555
day        100.000000
dtype: float64



In [43]:

    
# plot values
sample_df[['col1', 'col2']].plot()









    Out[43]:





<matplotlib.axes.AxesSubplot at 0x88d5d30>

OK. let's load some data

Pandas supports loading data from csv, html (will load from url, detect HTML tables and try to parse),

json, excel, sql, text files, clipboard memory, hdf and pickles too.



In [44]:

    
names = read_csv(r'c:\pandas-exercises-master\baby-names2.csv')
names.head()



In [46]:

    
names.dtypes









    Out[46]:





year         int64
name        object
prop       float64
sex         object
soundex     object
dtype: object



In [47]:

    
names.tail()



In [48]:

    
names.dtypes









    Out[48]:





year         int64
name        object
prop       float64
sex         object
soundex     object
dtype: object



In [49]:

    
names.columns









    Out[49]:





Index([u'year', u'name', u'prop', u'sex', u'soundex'], dtype=object)



In [50]:

    
# names are ordered by popularity, for year 1880 the top names were
names[names['year']==1880].head()



In [51]:

    
# and the least popular
names[names['year']==1880].tail()



In [52]:

    
# separate the boys from the girls
boys = names[names['sex']=='boy']
girls = names[names['sex']=='girl']
girls









    Out[52]:




<class 'pandas.core.frame.DataFrame'>
Int64Index: 129000 entries, 129000 to 257999
Data columns (total 5 columns):
year       129000  non-null values
name       129000  non-null values
prop       129000  non-null values
sex        129000  non-null values
soundex    129000  non-null values
dtypes: float64(1), int64(1), object(3)



In [53]:

    
# perform a groupby operation
boys.groupby('year')
# this has created a groupby object, it has not performed any aggregation or grouping yet









    Out[53]:





<pandas.core.groupby.DataFrameGroupBy object at 0x000000000A578978>



In [54]:

    
boys.groupby('year').size()
# shows we have 1000 entries for each year









    Out[54]:





year
1880    1000
1881    1000
1882    1000
1883    1000
1884    1000
1885    1000
1886    1000
1887    1000
1888    1000
1889    1000
1890    1000
1891    1000
1892    1000
1893    1000
1894    1000
...
1994    1000
1995    1000
1996    1000
1997    1000
1998    1000
1999    1000
2000    1000
2001    1000
2002    1000
2003    1000
2004    1000
2005    1000
2006    1000
2007    1000
2008    1000
Length: 129, dtype: int64



In [55]:

    
# this is the same for both boys and girls
names.groupby(['year','sex']).size()









    Out[55]:





year  sex 
1880  boy     1000
      girl    1000
1881  boy     1000
      girl    1000
1882  boy     1000
      girl    1000
1883  boy     1000
      girl    1000
1884  boy     1000
      girl    1000
1885  boy     1000
      girl    1000
1886  boy     1000
      girl    1000
1887  boy     1000
...
2001  girl    1000
2002  boy     1000
      girl    1000
2003  boy     1000
      girl    1000
2004  boy     1000
      girl    1000
2005  boy     1000
      girl    1000
2006  boy     1000
      girl    1000
2007  boy     1000
      girl    1000
2008  boy     1000
      girl    1000
Length: 258, dtype: int64



In [56]:

    
# we can index this object too, this selects top names for year 2001,2002
names.groupby(['year','sex']).size().ix[2000]









    Out[56]:





sex
boy     1000
girl    1000
dtype: int64



In [57]:

    
# so lets filter so we can see the most popular names by year
boys[boys.year == 2000] # note that columns are also attributes









    Out[57]:




<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 120000 to 120999
Data columns (total 5 columns):
year       1000  non-null values
name       1000  non-null values
prop       1000  non-null values
sex        1000  non-null values
soundex    1000  non-null values
dtypes: float64(1), int64(1), object(3)



In [58]:

    
# display the proportion
boys[boys.year==2000].prop









    Out[58]:





120000    0.016514
120001    0.015346
120002    0.013695
120003    0.013194
120004    0.011990
120005    0.011815
120006    0.011324
120007    0.010930
120008    0.010681
120009    0.010305
120010    0.009896
120011    0.009745
120012    0.009715
120013    0.009617
120014    0.009516
...
120985    0.000072
120986    0.000072
120987    0.000072
120988    0.000072
120989    0.000071
120990    0.000071
120991    0.000071
120992    0.000071
120993    0.000071
120994    0.000071
120995    0.000071
120996    0.000071
120997    0.000071
120998    0.000071
120999    0.000071
Name: prop, Length: 1000, dtype: float64



In [59]:

    
# display the top 5
boys[boys.year == 2000][:5]









    Out[59]:






  
    
      
      year
      name
      prop
      sex
      soundex
    
  
  
    
      120000
       2000
             Jacob
       0.016514
       boy
       J210
    
    
      120001
       2000
           Michael
       0.015346
       boy
       M240
    
    
      120002
       2000
           Matthew
       0.013695
       boy
       M300
    
    
      120003
       2000
            Joshua
       0.013194
       boy
       J200
    
    
      120004
       2000
       Christopher
       0.011990
       boy
       C623



In [60]:

    
# we can get the index value for the top proportion value
boys[boys.year == 2000].prop.idxmax()









    Out[60]:





120000



In [61]:

    
# this can be used for indexing
boys.ix[boys[boys.year == 2000].prop.idxmax()]  # to get the whole row.









    Out[61]:





year           2000
name          Jacob
prop       0.016514
sex             boy
soundex        J210
Name: 120000, dtype: object



In [62]:

    
# using this we can now write a function
def get_max_record(group):
    return group.ix[group.prop.idxmax()]
get_max_record(boys)









    Out[62]:





year           1880
name           John
prop       0.081541
sex             boy
soundex        J500
Name: 0, dtype: object



In [63]:

    
# do this for each year
result = boys.groupby('year').apply(get_max_record)



In [64]:

    
set_option('display.max_rows', 500)
result



In [65]:

    
# plot this, this shows that the proportion required to be the most popular name has fallen generally, there are some notable peaks and troughs though.
# can you think why?
result.prop.plot()









    Out[65]:





<matplotlib.axes.AxesSubplot at 0x885c2e8>



In [66]:

    
# lets select a specific name
boys[boys.name == 'Travis']



In [67]:

    
# we can set a mutli index on the data frame and use this
idf = boys.set_index(['name', 'year'])



In [68]:

    
# select the bottom 50
idf[-50:]



In [69]:

    
# where's travis?
idf.ix['Travis']



In [70]:

    
# plot his popularity
idf.ix['Travis'].prop.plot()









    Out[70]:





<matplotlib.axes.AxesSubplot at 0x875cac8>



In [72]:

    
# calc the mean proportion
boys.groupby('name')['prop'].mean()









    Out[72]:





name
Aaden       0.000442
Aarav       0.000101
Aaron       0.002266
Ab          0.000044
Abb         0.000046
Abbie       0.000046
Abbott      0.000044
Abdiel      0.000092
Abdul       0.000064
Abdullah    0.000094
Abe         0.000225
Abel        0.000152
Abelardo    0.000041
Abie        0.000042
Abner       0.000130
...
Zander       0.000369
Zane         0.000175
Zavier       0.000098
Zayden       0.000185
Zayne        0.000091
Zeb          0.000080
Zebulon      0.000055
Zechariah    0.000112
Zed          0.000056
Zeke         0.000055
Zenas        0.000052
Zeno         0.000071
Zigmund      0.000052
Zion         0.000490
Zollie       0.000060
Name: prop, Length: 3437, dtype: float64






    Out[72]:





name
Aaden       0.000442
Aarav       0.000101
Aaron       0.002266
Ab          0.000044
Abb         0.000046
Abbie       0.000046
Abbott      0.000044
Abdiel      0.000092
Abdul       0.000064
Abdullah    0.000094
Abe         0.000225
Abel        0.000152
Abelardo    0.000041
Abie        0.000042
Abner       0.000130
...
Zander       0.000369
Zane         0.000175
Zavier       0.000098
Zayden       0.000185
Zayne        0.000091
Zeb          0.000080
Zebulon      0.000055
Zechariah    0.000112
Zed          0.000056
Zeke         0.000055
Zenas        0.000052
Zeno         0.000071
Zigmund      0.000052
Zion         0.000490
Zollie       0.000060
Name: prop, Length: 3437, dtype: float64



In [73]:

    
# sort it
boys.groupby('name')['prop'].mean().order()









    Out[73]:





name
Danniel     0.000027
Kennard     0.000027
Deryl       0.000028
Grayling    0.000028
Michial     0.000028
Cheryl      0.000028
Dwyane      0.000028
Levern      0.000029
Diane       0.000029
Randolf     0.000029
Theodis     0.000029
Kimball     0.000029
Renard      0.000029
Patric      0.000029
Lex         0.000030
...
Paul       0.008322
Daniel     0.008517
Frank      0.010386
Edward     0.011069
Richard    0.014183
Thomas     0.014739
George     0.016254
David      0.016737
Joseph     0.017771
Michael    0.018342
Charles    0.019521
Robert     0.029625
William    0.034182
James      0.035465
John       0.041082
Name: prop, Length: 3437, dtype: float64



In [74]:

    
# get some stats
boys['prop'].describe()









    Out[74]:





count    129000.000000
mean          0.000916
std           0.003441
min           0.000026
25%           0.000071
50%           0.000142
75%           0.000453
max           0.081541
dtype: float64



In [75]:

    
# groupby by year, call describe on the prop column
result = boys.groupby('year')['prop'].describe()



In [76]:

    
# display the first 50
result[:50]









    Out[76]:





year       
1880  count    1000.000000
      mean        0.000931
      std         0.004871
      min         0.000042
      25%         0.000059
      50%         0.000118
      75%         0.000372
      max         0.081541
1881  count    1000.000000
      mean        0.000930
      std         0.004812
      min         0.000037
      25%         0.000065
      50%         0.000120
      75%         0.000369
      max         0.080975
1882  count    1000.000000
      mean        0.000928
      std         0.004691
      min         0.000041
      25%         0.000066
      50%         0.000115
      75%         0.000377
      max         0.078314
1883  count    1000.000000
      mean        0.000929
      std         0.004675
      min         0.000044
      25%         0.000062
      50%         0.000116
      75%         0.000382
      max         0.079066
1884  count    1000.000000
      mean        0.000927
      std         0.004542
      min         0.000041
      25%         0.000065
      50%         0.000122
      75%         0.000385
      max         0.076476
1885  count    1000.000000
      mean        0.000926
      std         0.004459
      min         0.000043
      25%         0.000069
      50%         0.000121
      75%         0.000373
      max         0.075517
1886  count    1000.000000
      mean        0.000926
dtype: float64



In [77]:

    
# lets look at 2008
df = boys[boys.year == 2008]
df.prop









    Out[77]:





128000    0.010355
128001    0.009437
128002    0.009301
128003    0.008799
128004    0.008702
128005    0.008566
128006    0.008442
128007    0.008438
128008    0.008268
128009    0.008061
128010    0.007877
128011    0.007694
128012    0.007570
128013    0.007467
128014    0.007262
...
128985    0.000092
128986    0.000091
128987    0.000091
128988    0.000091
128989    0.000091
128990    0.000090
128991    0.000090
128992    0.000090
128993    0.000090
128994    0.000090
128995    0.000090
128996    0.000089
128997    0.000089
128998    0.000089
128999    0.000089
Name: prop, Length: 1000, dtype: float64



In [78]:

    
df = boys[boys.year == 2008].sort_index(by='prop', ascending=False)   # If not in descending order. Can also do ascending=True for ascending.
df.prop









    Out[78]:





128000    0.010355
128001    0.009437
128002    0.009301
128003    0.008799
128004    0.008702
128005    0.008566
128006    0.008442
128007    0.008438
128008    0.008268
128009    0.008061
128010    0.007877
128011    0.007694
128012    0.007570
128013    0.007467
128014    0.007262
...
128985    0.000092
128986    0.000091
128987    0.000091
128988    0.000091
128989    0.000091
128992    0.000090
128994    0.000090
128993    0.000090
128995    0.000090
128991    0.000090
128990    0.000090
128997    0.000089
128998    0.000089
128996    0.000089
128999    0.000089
Name: prop, Length: 1000, dtype: float64



In [79]:

    
# perform a cumulative summation
df.prop.cumsum()   # numpy operation









    Out[79]:





128000    0.010355
128001    0.019792
128002    0.029093
128003    0.037892
128004    0.046594
128005    0.055160
128006    0.063602
128007    0.072040
128008    0.080308
128009    0.088369
128010    0.096246
128011    0.103940
128012    0.111510
128013    0.118977
128014    0.126239
...
128985    0.794154
128986    0.794245
128987    0.794336
128988    0.794427
128989    0.794518
128992    0.794608
128994    0.794698
128993    0.794788
128995    0.794878
128991    0.794968
128990    0.795058
128997    0.795147
128998    0.795236
128996    0.795325
128999    0.795414
Length: 1000, dtype: float64



In [80]:

    
# how many does it take to reach 50%. Also called a measure of diversity.
df.prop.cumsum().searchsorted(0.5)









    Out[80]:





127



In [81]:

    
# plot the rank where 50% proportion is
def get_quantile_count(group, quantile = 0.5):
    df = group.sort_index(by='prop', ascending=False)
    return df.prop.cumsum().searchsorted(quantile)

boys.groupby('year').apply(get_quantile_count).plot()









    Out[81]:





<matplotlib.axes.AxesSubplot at 0xa7ab3c8>



In [82]:

    
# do this for boys and girls
def get_quantile_count(group, quantile=0.5):     # Problem with no different colors for boys and girls.
    group = group.groupby('soundex').sum()
    df = group.sort_index(by='prop', ascending=False)
    return df.prop.cumsum().searchsorted(quantile)

#f = lambda x: get_quantile_count(x, 0.1)
q = 0.5
boy_ct = boys.groupby('year').apply(get_quantile_count, quantile=q)   # to pass different values for quantile
girl_ct = girls.groupby('year').apply(get_quantile_count, quantile=q)
boy_ct.plot(label='boy')
girl_ct.plot(label='girl')
legend(loc='best')    # with --pylab=inline, we don't have to do plt.legend()









    Out[82]:





<matplotlib.legend.Legend at 0xad1ae80>



In [83]:

    
# display the ranking
boys[boys.year == 2008].prop.rank()    # mean rank by default.









    Out[83]:





128000    1000
128001     999
128002     998
128003     997
128004     996
128005     995
128006     994
128007     993
128008     992
128009     991
128010     990
128011     989
128012     988
128013     987
128014     986
...
128985    16.0
128986    12.5
128987    12.5
128988    12.5
128989    12.5
128990     7.5
128991     7.5
128992     7.5
128993     7.5
128994     7.5
128995     7.5
128996     2.5
128997     2.5
128998     2.5
128999     2.5
Name: prop, Length: 1000, dtype: float64



In [84]:

    
# create group object
grouped = boys.groupby('year')['prop']



In [85]:

    
grouped.transform(Series.rank)   # transform is more rigid than apply. Output the same size as the input.









    Out[85]:





0     1000
1      999
2      998
3      997
4      996
5      995
6      994
7      993
8      992
9      991
10     990
11     989
12     988
13     987
14     986
...
128985    16.0
128986    12.5
128987    12.5
128988    12.5
128989    12.5
128990     7.5
128991     7.5
128992     7.5
128993     7.5
128994     7.5
128995     7.5
128996     2.5
128997     2.5
128998     2.5
128999     2.5
Name: prop, Length: 129000, dtype: float64



In [86]:

    
# add this as a column
boys['year_rank'] = grouped.transform(Series.rank)



In [87]:

    
# how popular is Alan
boys[boys.name == 'Alan'].year_rank.plot()









    Out[87]:





<matplotlib.axes.AxesSubplot at 0xa78e160>



In [89]:

    
# Other names
boys[boys.name == 'Michael'].year_rank.plot()









    Out[89]:





<matplotlib.axes.AxesSubplot at 0xc6e8978>



In [90]:

    
# lets load some birth data
births = read_csv(r'c:\pandas-exercises-master\births.csv')



In [91]:

    
merged = merge(names, births, on=['year', 'sex'])   # merge 2 tables: names and births., inner by default



In [92]:

    
# calc the number of people born with a given name by multiplying proportion by total number of births
merged['persons'] = np.floor(merged.prop * merged.births)



In [94]:

    
merged.head()



In [95]:

    
merged.groupby(['name', 'sex'])['persons'].sum()   # slice and dice. It's a hierarhical labeling.









    Out[95]:





name      sex 
Aaden     boy        959
Aaliyah   girl     39660
Aarav     boy        219
Aaron     boy     508034
          girl      1365
Ab        boy         24
Abagail   girl      2680
Abb       boy         15
Abbey     girl     14333
Abbie     boy         10
          girl     16556
Abbigail  girl      6792
Abbott    boy          9
Abby      girl     47845
Abdiel    boy        199
...
Zilpah  girl        8
Zilpha  girl       70
Zina    girl     2263
Zion    boy     11428
        girl     1530
Zita    girl      831
Zoa     girl       73
Zoe     girl    66271
Zoey    girl    19335
Zoie    girl     4688
Zola    girl     4847
Zollie  boy        60
Zona    girl     3218
Zora    girl     4551
Zula    girl     3578
Name: persons, Length: 7455, dtype: float64



In [96]:

    
merged.groupby(['name', 'sex'])['persons'].sum().order()









    Out[96]:





name      sex 
Achsah    girl    3
Ala       girl    3
Alabama   girl    3
Alfred    girl    3
Amey      girl    3
Chester   girl    3
Chestina  girl    3
Clem      girl    3
Crissie   girl    3
Daisye    girl    3
Dema      girl    3
Dollye    girl    3
Eithel    girl    3
Abie      boy     4
Acey      boy     4
...
Elizabeth    girl    1540430
Patricia     girl    1567602
Daniel       boy     1776483
Christopher  boy     1924919
Thomas       boy     2239606
Charles      boy     2320818
Joseph       boy     2490274
Richard      boy     2535681
David        boy     3495940
William      boy     3951365
Mary         girl    4097626
Michael      boy     4207352
Robert       boy     4752198
John         boy     5016124
James        boy     5021269
Name: persons, Length: 7455, dtype: float64



In [97]:

    
mboys = merge(boys, births)  # inner join by default.



In [98]:

    
# calc the number of boys born with a given name
mboys['persons'] = np.floor(mboys.prop * mboys.births)



In [99]:

    
# Select out persons
persons = mboys.set_index(['year', 'name']).persons



In [100]:

    
persons  # hierarhical index









    Out[100]:





year  name   
1880  John       9654
      William    9532
      James      5926
      Charles    5347
      George     5125
      Frank      3241
      Joseph     2632
      Thomas     2533
      Henry      2443
      Robert     2415
      Edward     2363
      Harry      2152
      Walter     1754
      Arthur     1598
      Fred       1568
...
2008  Eliezer      199
      Bronson      197
      Cornelius    197
      Pierre       197
      Cortez       197
      Baron        195
      Carlo        195
      Carsen       195
      Fletcher     195
      Izayah       195
      Kolten       195
      Damari       193
      Hugh         193
      Jensen       193
      Yurem        193
Name: persons, Length: 129000, dtype: float64



In [101]:

    
# Select out all the people named Chris. Plot is kind of crowded. Matplotlib doesn't go more than 130 in x axis.
persons.ix[:, 'Christopher'].plot(kind='bar', rot=90)









    Out[101]:





<matplotlib.axes.AxesSubplot at 0xc70d2b0>



In [102]:

    
# what about me?
persons.ix[:, 'Alan'].plot(kind='bar', rot=90)









    Out[102]:





<matplotlib.axes.AxesSubplot at 0xf54f7f0>



In [103]:

    
persons.unstack('name')   # Create a data frame whose columns are each unique names, and the row indexes are the years.









    Out[103]:




<class 'pandas.core.frame.DataFrame'>
Int64Index: 129 entries, 1880 to 2008
Columns: 3437 entries, Aaden to Zollie
dtypes: float64(3437)



In [104]:

    
result = _   # underscore in ipython: the output of the last statement, because we don't want to compute again the same thing.



In [105]:

    
result['Alan']









    Out[105]:





year
1880     NaN
1881     NaN
1882       5
1883     NaN
1884       8
1885      11
1886       4
1887       5
1888       5
1889       9
1890      11
1891       8
1892      16
1893      11
1894      10
1895      21
1896      13
1897      19
1898      14
1899      17
1900      26
1901      19
1902      26
1903      24
1904      28
1905      23
1906      27
1907      38
1908      30
1909      44
1910      59
1911      55
1912     102
1913     158
1914     185
1915     236
1916     251
1917     309
1918     332
1919     331
1920     431
1921     375
1922     479
1923     588
1924     623
1925     803
1926     836
1927     810
1928     963
1929    1100
1930    1304
1931    1418
1932    1339
1933    1341
1934    1367
1935    1424
1936    1756
1937    1840
1938    2240
1939    2331
1940    2377
1941    2624
1942    3136
1943    3855
1944    4278
1945    5077
1946    6482
1947    7813
1948    7391
1949    7233
1950    7737
1951    8382
1952    8314
1953    8433
1954    8705
1955    9027
1956    8720
1957    8213
1958    8221
1959    8150
1960    8357
1961    8616
1962    7702
1963    6938
1964    6807
1965    5617
1966    4943
1967    4566
1968    3986
1969    3862
1970    3554
1971    3167
1972    2576
1973    2268
1974    2229
1975    2249
1976    2188
1977    2131
1978    2116
1979    2321
1980    2665
1981    2583
1982    2678
1983    2767
1984    2698
1985    2683
1986    2640
1987    2583
1988    2751
1989    2723
1990    2640
1991    2541
1992    2602
1993    2559
1994    2433
1995    2291
1996    2442
1997    2175
1998    2137
1999    2246
2000    2413
2001    2642
2002    2618
2003    3094
2004    3062
2005    3204
2006    3467
2007    3244
2008    3020
Name: Alan, Length: 129, dtype: float64



In [106]:

    
result['Alan'].plot()









    Out[106]:





<matplotlib.axes.AxesSubplot at 0xf53b6d8>

What else is Pandas useful for?

As Pandas uses Numpy underneath it can be used as the data structure for many Numpy compatible libraries

Additionally since you can convert the values to a list, dict, json etc.. it is compatible with pretty much everything



In [107]:

    
# e.g.
sample_df['col2'].tolist()









    Out[107]:





[0.017837544444303886,
 0.6786099870029839,
 -0.08891275376269536,
 -0.7387600206972956,
 0.0]



In [108]:

    
# dataframes become a list of lists
sample_df[['col1','col2']].values.tolist()









    Out[108]:





[[0.4709464627097036, 0.017837544444303886],
 [0.5850881841278334, 0.6786099870029839],
 [1.641423457204322, -0.08891275376269536],
 [-0.2071279112214586, -0.7387600206972956],
 [-0.8036489512527647, 0.0]]



In [109]:

    
# for dictionaries key values are columns, values are dicts with key values as row index value: row value
sample_df.to_dict()









    Out[109]:





{'col1': {'a': 0.47094646270970358,
  'b': 0.58508818412783337,
  'c': 1.641423457204322,
  'd': -0.2071279112214586,
  'e': -0.8036489512527647},
 'col2': {'a': 0.017837544444303886,
  'b': 0.67860998700298392,
  'c': -0.088912753762695362,
  'd': -0.73876002069729563,
  'e': 0.0},
 'date': {'a': Timestamp('2013-09-20 15:20:05.470000', tz=None),
  'b': Timestamp('2013-09-20 15:20:05.470000', tz=None),
  'c': Timestamp('2013-09-20 15:20:05.470000', tz=None),
  'd': Timestamp('2013-09-20 15:20:05.470000', tz=None),
  'e': Timestamp('2013-09-20 15:20:05.470000', tz=None)},
 'day': {'a': 20, 'b': 20, 'c': 20, 'd': 20, 'e': 20},
 'diff': {'a': 0.45310891826539967,
  'b': -0.09352180287515055,
  'c': 1.7303362109670175,
  'd': 0.53163210947583706,
  'e': 0.0},
 'new_col': {'a': -0.55161032907010144,
  'b': -0.42492282964636496,
  'c': 1.158524297399135,
  'd': -1.5063906936060718,
  'e': -0.75485284391010221}}

A quick funky d3.js visualisation example

Some background about De Bruijn Graph http://en.wikipedia.org/wiki/De_Bruijn_graph

Set-up networkx for visualizing the local structure of such network.



In [110]:

    
import networkx as nx
from IPython.core.display import display_javascript
#from IPython.frontend.html.notebook import visutils as vis
import json
import time

Load D3.js



In [111]:

    
%install_ext https://raw.github.com/cschin/ipython_d3_mashup/master/extension/visutils.py
%reload_ext visutils
vis.run_js("$.getScript('http://d3js.org/d3.v2.js')")
vis.run_js("$.getScript('https://raw.github.com/cschin/ipython_d3_mashup/master/extension/vis_extension.js')")
time.sleep(2)
vis.run_js("IPython.vis_init();")









    



Installed visutils.py. To use it, type:
  %load_ext visutils

Set up the visulization "cell"/widget.



In [112]:

    
try:
    vis_display.remove()
except:
    pass
plot_area_style = {"position":"absolute",
                   "top":"0px",
                   "width":"850px",
                   "left":"750px",
                   "height":"350px",
                   "border":"9px groove",
                   "background-color":"rgba(200,200,200,0.5)"}

vis_cell = vis.VISCellWidget(name="plot_area", style = plot_area_style)

## attache the container to a "visual display"
vis_display = vis.NotebookVisualDisplay(container = vis_cell)


## create the SVG element for D3
svg_style = {"width":"850px", 
             "height":"300px",
             "border":"2px solid"}

svg = vis.SVGWidget(name = "svg_display", 
                   parent = "plot_area", 
                   style = svg_style,
                   vis = vis_display)



In [113]:

    
def replace_nodes(G0, Ns, N0):
    G0.add_node(N0, X="X")

    head = Ns[0]
    tail = Ns[-1]
    
    p = G0.predecessors(Ns[0])
    if len(p) == 1:
        p = p[0]
        G0.add_edge(p, N0)
        
    n = G0.successors(Ns[-1])
    if len(n) == 1:
        n = n[0]
        G0.add_edge(N0, n)
    
    G0.remove_nodes_from(Ns)

def reduce_graph(G0):
    G1 = G0.copy()
    G2 = G0.copy()
    nodes_to_remove = []
    for n in G1.nodes():
        if len(G1.successors(n)) > 1 or len(G1.predecessors(n)) > 1:
            if n[0] != "^" and n[-1] != "$":
                nodes_to_remove.append(n)
    G1.remove_nodes_from(nodes_to_remove)

    for ns in nx.weakly_connected_components(G1):
        ns = [n for n in ns if n[0] != "^" and n[-1] != "$"]
        if len(ns) == 0: continue
        contig = []
        n_sorted = nx.topological_sort(G1, ns)
        n_sorted = [n for n in n_sorted if n[0] != "^" and n[-1] != "$"]
        if len(n_sorted) <= 1: continue

        for kmer in n_sorted:
            assert len(G1.successors(kmer)) <= 1
            if len(contig) == 0:
                contig.append(kmer)
            else:
                contig.append(kmer[-1])
        replace_nodes(G2, n_sorted, "".join(contig))
        
    return G2

Generate the json for d3.js for the force layout.



In [114]:

    
def get_group(w, w1):
    c = 0
    for c1, c2 in zip(w,w1):
        if c1 == c2:
            c += 1
            continue
        break
    return c

def set_g_json(seq, k, reduce_g = False):
    G=nx.DiGraph()
    seq = "^"+seq+"$"
    for i in range(len(seq)-k+1):
        w1 = seq[i:i+k-1]
        w2 = seq[i+1:i+k]
        G.add_edge(w1, w2)
    if reduce_g == True:
        G = reduce_graph(G)
    def generateD3JSONForG(G):
        s = {"nodes":[], "links":[]}
        name2Idx = {}
        c = 0
        for n in G.nodes():
            #print n
            g = len(G.neighbors(n))
            if "^" in n:
                s["nodes"].append({"name":n, "group":g, "fixed":True, "x":0,"y":150})
            elif "$" in n:
                s["nodes"].append({"name":n, "group":g, "fixed":True, "x":850,"y":150})
            else:
                s["nodes"].append({"name":n, "group":g})
            name2Idx[n] = c
            c += 1
        for e in G.edges():
            col = "rgb(0,0,255)"
            width = 1
            s["links"].append({"source":name2Idx[e[0]], "target":name2Idx[e[1]], "color":col, "width":width})
        return json.dumps(s)     
    n_json = generateD3JSONForG(G)
    vis_cell.set_js_var("n_json", n_json)

Set up d3.js code for showing up the network around the word START.



In [115]:

    
vis_display.js_code=[]

set_g_json("ACGTACGTTGTGCAGTAGTAGTAGT",5)

js = """

(function() {

var plot_neighbor=function(json) {
  var w = 850,
      h = 300,
      fill = d3.scale.category10();
  var vis = d3.select("#plot_area #svg_display")
  var force = d3.layout.force()
      .charge(-40)
      .linkDistance(2)
      .nodes(json.nodes)
      .links(json.links)
      .size([w, h])
      .linkStrength(0.1)
      .start();

  var link = vis.selectAll("line.link")
      .data(json.links)
      .enter().append("svg:line")
      .attr("class", "link")
      .style("stroke-width", function(d) { return d.width; })
      .style("stroke", function(d) { return d.color; })
      .attr("x1", function(d) { return d.source.x; })
      .attr("y1", function(d) { return d.source.y; })
      .attr("x2", function(d) { return d.target.x; })
      .attr("y2", function(d) { return d.target.y; });

  var node = vis.selectAll("circle.node")
      .data(json.nodes)
      .enter().append("svg:circle")
      .attr("class", "node")
      .attr("cx", function(d) { return d.x; })
      .attr("cy", function(d) { return d.y; })
      .attr("r", 4)
      .style("fill", function(d) { return fill(d.group); })
      .call(force.drag);

  node.append("svg:title")
      .text(function(d) { return d.name; });

  vis.style("opacity", 1e-6)
    .transition()
      .duration(1000)
      .style("opacity", 1);


// Per-type markers, as they don't inherit styles.
vis.append("svg:defs").selectAll("marker")
    .data(["arrow"])
    .enter().append("svg:marker")
    .attr("id", String)
    .attr("viewBox", "0 -5 10 10")
    .attr("refX", 15)
    .attr("refY", -1.5)
    .attr("markerWidth", 6)
    .attr("markerHeight", 6)
    .attr("orient", "auto")
    .append("svg:path")
    .attr("d", "M0,-5L10,0L0,5");
    

var path = vis.append("svg:g").selectAll("path")
    .data(force.links())
  .enter().append("svg:path")
    .attr("class", function(d) { return "link arrow"; })
    .attr("marker-end", function(d) { return "url(#arrow)"; })
    .style("stroke-width", "1.5px");

  force.on("tick", function() {
    link.attr("x1", function(d) { return d.source.x; })
        .attr("y1", function(d) { return d.source.y; })
        .attr("x2", function(d) { return d.target.x; })
        .attr("y2", function(d) { return d.target.y; });

    node.attr("cx", function(d) { return d.x; })
        .attr("cy", function(d) { return d.y; });

    path.attr("d", function(d) {
      var dx = d.target.x - d.source.x,
          dy = d.target.y - d.source.y,
          dr = Math.sqrt(dx * dx + dy * dy);
      return "M" + d.source.x + "," + d.source.y + "L" + d.target.x + "," + d.target.y;
    })

  });};
var vc = IPython.vis_utils.name_to_viscell["plot_area"];
//alert(vc.data);
var n_json=$.parseJSON(vc.data.n_json);
//var n_json = vc.data.n_json;
//alert(vc.data["n_json"]);
plot_neighbor(n_json)})()
"""
vis_display.attach_js_code(js)
vis_display.refresh()



In [116]:

    
vis_display.hide()



In [117]:

    
vis_display.show()



In [118]:

    
def show_neighbors(w, k, reduce_g = False):
    vis.run_js('$("#svg_display *").remove();')
    set_g_json(w, k, reduce_g = reduce_g)
    for jc in vis_display.js_code:
        vis.run_js(jc)
    
    
## create a test input text box
input_style = {"width":"440px"}
tb = vis.InputWidget(name = "input_1",
                     parent = "plot_area",
                     style = input_style,
                     value = "AATTAATTAAGGTTTTAATTATTAATTGTAATTAATTAATTAATACTGAT",
                     vis = vis_display)

def onchange(self, *argv, **kwargv):
    self.update_value()
    
vis.set_action(tb, "onchange", onchange)


## create a input text box for k
input_style = {"width":"40px"}
kb = vis.InputWidget(name = "input_2",
                     parent = "plot_area",
                     style = input_style,
                     value = "5",
                     vis = vis_display)
    
vis.set_action(kb, "onchange", onchange)

button_style = {"width":"120px"}
b = vis.ButtonWidget(name="button", 
                     parent="plot_area", 
                     style=button_style, 
                     text="show graph",
                     vis = vis_display)
b.argv = [tb, kb]
def onclick(self, *argv, **kwargv):
    self.text = argv[0].value
    self.k = int(argv[1].value)
    show_neighbors(self.text, self.k, reduce_g = False)
vis.set_action(b, "onclick", onclick, "argv")
    
    
button_style = {"width":"120px"}
b_r = vis.ButtonWidget(name="button2", 
                     parent="plot_area", 
                     style=button_style, 
                     text="reduce graph",
                     vis = vis_display)
b_r.argv = [tb, kb]
def onclick(self, *argv, **kwargv):
    self.text = argv[0].value
    self.k = int(argv[1].value)
    show_neighbors(self.text, self.k, reduce_g = True)
vis.set_action(b_r, "onclick", onclick, "argv")    
    

button_style = {"width":"120px"}
b2 = vis.ButtonWidget(name="button3", 
                     parent="plot_area", 
                     style=button_style, 
                     text="close",
                     vis = vis_display)
def onclick(self, *argv, **kwargv):
    vis_display.remove()
    
vis.set_action(b2, "onclick", onclick)

vis_display.refresh()



In [ ]:



In [ ]:

More resources (Books):

Python for Data Analysis - written by the main developer of Pandas

Learning IPython for Interactive Computing and Data Visualization - the only published book specifically on IPython

YouTube vids:

Sample IPython notebooks

Matplotlib

Questions??



In [ ]:



In [ ]:

	0
0	0.077678
1	-0.424129
2	-0.144588
3	1.504509
4	1.015034

	A
1	0.258513
c	-1.457385
5	-1.145670
g	0.264279
t	0.869407

	year	name	prop	sex	soundex
257995	2008	Carleigh	0.000128	girl	C642
257996	2008	Iyana	0.000128	girl	I500
257997	2008	Kenley	0.000127	girl	K540
257998	2008	Sloane	0.000127	girl	S450
257999	2008	Elianna	0.000127	girl	E450

	col1	col2
a	0.470946	0.017838
b	0.585088	0.678610
c	1.641423	-0.088913
d	-0.207128	-0.738760
e	-0.803649	NaN

	col1	col2	new_col
a	0.470946	0.017838	-0.551610
b	0.585088	0.678610	-0.424923
c	1.641423	-0.088913	1.158524
d	-0.207128	-0.738760	-1.506391
e	-0.803649	NaN	-0.754853

	col1	col2	new_col	diff	date	day
e	-0.803649	0.000000	-0.754853	0.000000	2013-09-20 15:20:05.470000	20
d	-0.207128	-0.738760	-1.506391	0.531632	2013-09-20 15:20:05.470000	20
c	1.641423	-0.088913	1.158524	1.730336	2013-09-20 15:20:05.470000	20
b	0.585088	0.678610	-0.424923	-0.093522	2013-09-20 15:20:05.470000	20
a	0.470946	0.017838	-0.551610	0.453109	2013-09-20 15:20:05.470000	20

	col1	col2	new_col	diff	day
count	5.000000	5.000000	5.000000	5.000000	5
mean	0.337336	-0.026245	-0.415850	0.524311	20
std	0.918989	0.502757	0.974855	0.727365	0
min	-0.803649	-0.738760	-1.506391	-0.093522	20
25%	-0.207128	-0.088913	-0.754853	0.000000	20
50%	0.470946	0.000000	-0.551610	0.453109	20
75%	0.585088	0.017838	-0.424923	0.531632	20
max	1.641423	0.678610	1.158524	1.730336	20

	year	name	prop	sex	soundex
0	1880	John	0.081541	boy	J500
1	1880	William	0.080511	boy	W450
2	1880	James	0.050057	boy	J520
3	1880	Charles	0.045167	boy	C642
4	1880	George	0.043292	boy	G620

	year	name	prop	sex	soundex
129995	1880	Emaline	0.000041	girl	E545
129996	1880	Ester	0.000041	girl	E236
129997	1880	Eulah	0.000041	girl	E400
129998	1880	Eulalie	0.000041	girl	E440
129999	1880	Euna	0.000041	girl	E500

	year	name	prop	sex	soundex
120000	2000	Jacob	0.016514	boy	J210
120001	2000	Michael	0.015346	boy	M240
120002	2000	Matthew	0.013695	boy	M300
120003	2000	Joshua	0.013194	boy	J200
120004	2000	Christopher	0.011990	boy	C623

	year	name	prop	sex	soundex
563	1880	Travis	0.000101	boy	T612
2648	1882	Travis	0.000082	boy	T612
3672	1883	Travis	0.000080	boy	T612
4656	1884	Travis	0.000081	boy	T612
5602	1885	Travis	0.000095	boy	T612
6598	1886	Travis	0.000092	boy	T612
9871	1889	Travis	0.000059	boy	T612
10733	1890	Travis	0.000075	boy	T612
11696	1891	Travis	0.000082	boy	T612
12593	1892	Travis	0.000099	boy	T612
13831	1893	Travis	0.000066	boy	T612
14849	1894	Travis	0.000064	boy	T612
15735	1895	Travis	0.000079	boy	T612
16774	1896	Travis	0.000077	boy	T612
17645	1897	Travis	0.000098	boy	T612
18991	1898	Travis	0.000053	boy	T612
19658	1899	Travis	0.000095	boy	T612
20442	1900	Travis	0.000166	boy	T612
22552	1902	Travis	0.000121	boy	T612
23778	1903	Travis	0.000077	boy	T612
24510	1904	Travis	0.000137	boy	T612
25505	1905	Travis	0.000147	boy	T612
26587	1906	Travis	0.000118	boy	T612
27535	1907	Travis	0.000132	boy	T612
28512	1908	Travis	0.000150	boy	T612
29451	1909	Travis	0.000181	boy	T612
30510	1910	Travis	0.000149	boy	T612
31467	1911	Travis	0.000166	boy	T612
32448	1912	Travis	0.000171	boy	T612
33413	1913	Travis	0.000203	boy	T612
34388	1914	Travis	0.000220	boy	T612
35426	1915	Travis	0.000182	boy	T612
36404	1916	Travis	0.000204	boy	T612
37393	1917	Travis	0.000215	boy	T612
38381	1918	Travis	0.000233	boy	T612
39384	1919	Travis	0.000228	boy	T612
40385	1920	Travis	0.000218	boy	T612
41383	1921	Travis	0.000219	boy	T612
42374	1922	Travis	0.000231	boy	T612
43385	1923	Travis	0.000209	boy	T612
44382	1924	Travis	0.000210	boy	T612
45388	1925	Travis	0.000208	boy	T612
46404	1926	Travis	0.000191	boy	T612
47364	1927	Travis	0.000220	boy	T612
48371	1928	Travis	0.000214	boy	T612
49338	1929	Travis	0.000246	boy	T612
50360	1930	Travis	0.000225	boy	T612
51325	1931	Travis	0.000266	boy	T612
52329	1932	Travis	0.000260	boy	T612
53311	1933	Travis	0.000290	boy	T612
54281	1934	Travis	0.000344	boy	T612
55292	1935	Travis	0.000324	boy	T612
56295	1936	Travis	0.000314	boy	T612
57300	1937	Travis	0.000293	boy	T612
58290	1938	Travis	0.000312	boy	T612
59295	1939	Travis	0.000312	boy	T612
60294	1940	Travis	0.000299	boy	T612
61309	1941	Travis	0.000278	boy	T612
62319	1942	Travis	0.000256	boy	T612
63347	1943	Travis	0.000216	boy	T612
64330	1944	Travis	0.000240	boy	T612
65342	1945	Travis	0.000225	boy	T612
66318	1946	Travis	0.000246	boy	T612
67337	1947	Travis	0.000218	boy	T612
68347	1948	Travis	0.000211	boy	T612
69349	1949	Travis	0.000207	boy	T612
70343	1950	Travis	0.000211	boy	T612
71353	1951	Travis	0.000197	boy	T612
72356	1952	Travis	0.000184	boy	T612
73355	1953	Travis	0.000189	boy	T612
74370	1954	Travis	0.000176	boy	T612
75380	1955	Travis	0.000171	boy	T612
76384	1956	Travis	0.000177	boy	T612
77409	1957	Travis	0.000164	boy	T612
78371	1958	Travis	0.000201	boy	T612
79346	1959	Travis	0.000233	boy	T612
80327	1960	Travis	0.000256	boy	T612
81323	1961	Travis	0.000271	boy	T612
82301	1962	Travis	0.000312	boy	T612
83257	1963	Travis	0.000430	boy	T612
84234	1964	Travis	0.000507	boy	T612
85221	1965	Travis	0.000568	boy	T612
86187	1966	Travis	0.000718	boy	T612
87165	1967	Travis	0.000902	boy	T612
88140	1968	Travis	0.001162	boy	T612
89112	1969	Travis	0.001567	boy	T612
90068	1970	Travis	0.002270	boy	T612
91060	1971	Travis	0.002762	boy	T612
92056	1972	Travis	0.003068	boy	T612
93052	1973	Travis	0.003637	boy	T612
94053	1974	Travis	0.003496	boy	T612
95050	1975	Travis	0.003674	boy	T612
96045	1976	Travis	0.004188	boy	T612
97043	1977	Travis	0.004259	boy	T612
98045	1978	Travis	0.004072	boy	T612
99035	1979	Travis	0.006169	boy	T612
100036	1980	Travis	0.005808	boy	T612
101039	1981	Travis	0.005435	boy	T612
102038	1982	Travis	0.005525	boy	T612
103040	1983	Travis	0.005627	boy	T612
104041	1984	Travis	0.005461	boy	T612
105045	1985	Travis	0.004725	boy	T612
106048	1986	Travis	0.004416	boy	T612
107041	1987	Travis	0.005335	boy	T612
108040	1988	Travis	0.005711	boy	T612
109040	1989	Travis	0.005500	boy	T612
110041	1990	Travis	0.005325	boy	T612
111043	1991	Travis	0.004786	boy	T612
112047	1992	Travis	0.004239	boy	T612
113049	1993	Travis	0.003845	boy	T612
114052	1994	Travis	0.003338	boy	T612
115065	1995	Travis	0.002788	boy	T612
116072	1996	Travis	0.002576	boy	T612
117080	1997	Travis	0.002346	boy	T612
118097	1998	Travis	0.001937	boy	T612
119101	1999	Travis	0.001770	boy	T612
120114	2000	Travis	0.001576	boy	T612
121133	2001	Travis	0.001424	boy	T612
122143	2002	Travis	0.001306	boy	T612
123158	2003	Travis	0.001209	boy	T612
124165	2004	Travis	0.001164	boy	T612
125173	2005	Travis	0.001114	boy	T612
126162	2006	Travis	0.001186	boy	T612
127169	2007	Travis	0.001087	boy	T612
128176	2008	Travis	0.001025	boy	T612

		prop	sex	soundex
name	year
Reilly	2008	0.000097	boy	R400
Sheldon	2008	0.000097	boy	S435
Abdullah	2008	0.000097	boy	A134
Jagger	2008	0.000097	boy	J260
Thaddeus	2008	0.000097	boy	T320
Case	2008	0.000096	boy	C200
Kyson	2008	0.000096	boy	K250
Lamont	2008	0.000096	boy	L553
Chaz	2008	0.000096	boy	C200
Makhi	2008	0.000096	boy	M200
Jan	2008	0.000095	boy	J500
Marques	2008	0.000095	boy	M622
Oswaldo	2008	0.000095	boy	O243
Donavan	2008	0.000095	boy	D515
Keyon	2008	0.000095	boy	K500
Kyan	2008	0.000095	boy	K500
Simeon	2008	0.000095	boy	S550
Trystan	2008	0.000095	boy	T623
Andreas	2008	0.000094	boy	A536
Dangelo	2008	0.000094	boy	D524
Landin	2008	0.000094	boy	L535
Reagan	2008	0.000094	boy	R250
Turner	2008	0.000094	boy	T656
Arnav	2008	0.000094	boy	A651
Brenton	2008	0.000094	boy	B653
Callum	2008	0.000094	boy	C450
Jayvion	2008	0.000094	boy	J150
Bridger	2008	0.000093	boy	B632
Sammy	2008	0.000093	boy	S500
Deegan	2008	0.000093	boy	D250
Jaylan	2008	0.000093	boy	J450
Lennon	2008	0.000093	boy	L550
Odin	2008	0.000093	boy	O350
Abdiel	2008	0.000092	boy	A134
Jerimiah	2008	0.000092	boy	J650
Eliezer	2008	0.000092	boy	E426
Bronson	2008	0.000091	boy	B652
Cornelius	2008	0.000091	boy	C654
Pierre	2008	0.000091	boy	P600
Cortez	2008	0.000091	boy	C632
Baron	2008	0.000090	boy	B650
Carlo	2008	0.000090	boy	C640
Carsen	2008	0.000090	boy	C625
Fletcher	2008	0.000090	boy	F432
Izayah	2008	0.000090	boy	I200
Kolten	2008	0.000090	boy	K435
Damari	2008	0.000089	boy	D560
Hugh	2008	0.000089	boy	H200
Jensen	2008	0.000089	boy	J525
Yurem	2008	0.000089	boy	Y650